Goto

Collaborating Authors

 generative method



TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models

Neural Information Processing Systems

Data collection is often difficult in critical fields such as medicine, physics, and chemistry, yielding typically only small tabular datasets. However, classification methods tend to struggle with these small datasets, leading to poor predictive performance. Increasing the training set with additional synthetic data, similar to data augmentation in images, is commonly believed to improve downstream tabular classification performance. However, current tabular generative methods that learn either the joint distribution $ p(\mathbf{x}, y) $ or the class-conditional distribution $ p(\mathbf{x} \mid y) $ often overfit on small datasets, resulting in poor-quality synthetic data, usually worsening classification performance compared to using real data alone. To solve these challenges, we introduce TabEBM, a novel class-conditional generative method using Energy-Based Models (EBMs). Unlike existing tabular methods that use a shared model to approximate all class-conditional densities, our key innovation is to create distinct EBM generative models for each class, each modelling its class-specific data distribution individually. This approach creates robust energy landscapes, even in ambiguous class distributions. Our experiments show that TabEBM generates synthetic data with higher quality and better statistical fidelity than existing methods. When used for data augmentation, our synthetic data consistently leads to improved classification performance across diverse datasets of various sizes, especially small ones.




Fiaingen: A financial time series generative method matching real-world data quality

Rožanec, Jože M., Žezlin, Tina, Vasiliu, Laurentiu, Mladenić, Dunja, Prodan, Radu, Roman, Dumitru

arXiv.org Artificial Intelligence

Data is vital in enabling machine learning models to advance research and practical applications in finance, where accurate and robust models are essential for investment and trading decision-making. However, real-world data is limited despite its quantity, quality, and variety. The data shortage of various financial assets directly hinders the performance of machine learning models designed to trade and invest in these assets. Generative methods can mitigate this shortage. In this paper, we introduce a set of novel techniques for time series data generation (we name them Fiaingen) and assess their performance across three criteria: (a) overlap of real-world and synthetic data on a reduced dimensionality space, (b) performance on downstream machine learning tasks, and (c) runtime performance. Our experiments demonstrate that the methods achieve state-of-the-art performance across the three criteria listed above. Synthetic data generated with Fiaingen methods more closely mirrors the original time series data while keeping data generation time close to seconds - ensuring the scalability of the proposed approach. Furthermore, models trained on it achieve performance close to those trained with real-world data.


Self-Alignment Learning to Improve Myocardial Infarction Detection from Single-Lead ECG

Jin, Jiarui, Fang, Xiaocheng, Wang, Haoyu, Li, Jun, Liu, Che, Xie, Donglin, Li, Hongyan, Hong, Shenda

arXiv.org Artificial Intelligence

Myocardial infarction is a critical manifestation of coronary artery disease, yet detecting it from single-lead electrocardiogram (ECG) remains challenging due to limited spatial information. An intuitive idea is to convert single-lead into multiple-lead ECG for classification by pre-trained models, but generative methods optimized at the signal level in most cases leave a large latent space gap, ultimately degrading diagnostic performance. This naturally raises the question of whether latent space alignment could help. However, most prior ECG alignment methods focus on learning transformation invariance, which mismatches the goal of single-lead detection. To address this issue, we propose SelfMIS, a simple yet effective alignment learning framework to improve myocardial infarction detection from single-lead ECG. Discarding manual data augmentations, SelfMIS employs a self-cutting strategy to pair multiple-lead ECG with their corresponding single-lead segments and directly align them in the latent space. This design shifts the learning objective from pursuing transformation invariance to enriching the single-lead representation, explicitly driving the single-lead ECG encoder to learn a representation capable of inferring global cardiac context from the local signal. Experimentally, SelfMIS achieves superior performance over baseline models across nine myocardial infarction types while maintaining a simpler architecture and lower computational overhead, thereby substantiating the efficacy of direct latent space alignment. Our code and checkpoint will be publicly available after acceptance.


Improving Generative Methods for Causal Evaluation via Simulation-Based Inference

Amaranath, Pracheta, Muralikrishnan, Vinitra, Sharma, Amit, Jensen, David D.

arXiv.org Artificial Intelligence

Generating synthetic datasets that accurately reflect real-world observational data is critical for evaluating causal estimators, but remains a challenging task. Existing generative methods offer a solution by producing synthetic datasets anchored in the observed data (source data) while allowing variation in key parameters such as the treatment effect and amount of confounding bias. However, existing methods typically require users to provide point estimates of such parameters (rather than distributions) and fixed estimates (rather than estimates that can be improved with reference to the source data). This denies users the ability to express uncertainty over parameter values and removes the potential for posterior inference, potentially leading to unreliable estimator comparisons. We introduce simulation-based inference for causal evaluation (SBICE), a framework that models generative parameters as uncertain and infers their posterior distribution given a source dataset. Leveraging techniques in simulation-based inference, SBICE identifies parameter configurations that produce synthetic datasets closely aligned with the source data distribution. Empirical results demonstrate that SBICE improves the reliability of estimator evaluations by generating more realistic datasets, which supports a robust and data-consistent approach to causal benchmarking under uncertainty.


Building surrogate models using trajectories of agents trained by Reinforcement Learning

Cestero, Julen, Quartulli, Marco, Restelli, Marcello

arXiv.org Artificial Intelligence

Sample efficiency in the face of computationally expensive simulations is a common concern in surrogate modeling. Current strategies to minimize the number of samples needed are not as effective in simulated environments with wide state spaces. As a response to this challenge, we propose a novel method to efficiently sample simulated deterministic environments by using policies trained by Reinforcement Learning. We provide an extensive analysis of these surrogate-building strategies with respect to Latin-Hypercube sampling or Active Learning and Kriging, cross-validating performances with all sampled datasets. The analysis shows that a mixed dataset that includes samples acquired by random agents, expert agents, and agents trained to explore the regions of maximum entropy of the state transition distribution provides the best scores through all datasets, which is crucial for a meaningful state space representation. We conclude that the proposed method improves the state-of-the-art and clears the path to enable the application of surrogate-aided Reinforcement Learning policy optimization strategies on complex simulators.


TabEBM: A Tabular Data Augmentation Method with Distinct Class-Specific Energy-Based Models

Neural Information Processing Systems

Data collection is often difficult in critical fields such as medicine, physics, and chemistry, yielding typically only small tabular datasets. However, classification methods tend to struggle with these small datasets, leading to poor predictive performance. Increasing the training set with additional synthetic data, similar to data augmentation in images, is commonly believed to improve downstream tabular classification performance. However, current tabular generative methods that learn either the joint distribution p(\mathbf{x}, y) or the class-conditional distribution p(\mathbf{x} \mid y) often overfit on small datasets, resulting in poor-quality synthetic data, usually worsening classification performance compared to using real data alone. To solve these challenges, we introduce TabEBM, a novel class-conditional generative method using Energy-Based Models (EBMs).


MAVOS-DD: Multilingual Audio-Video Open-Set Deepfake Detection Benchmark

Croitoru, Florinel-Alin, Hondru, Vlad, Popescu, Marius, Ionescu, Radu Tudor, Khan, Fahad Shahbaz, Shah, Mubarak

arXiv.org Artificial Intelligence

We present the first large-scale open-set benchmark for multilingual audio-video deepfake detection. Our dataset comprises over 250 hours of real and fake videos across eight languages, with 60% of data being generated. For each language, the fake videos are generated with seven distinct deepfake generation models, selected based on the quality of the generated content. We organize the training, validation and test splits such that only a subset of the chosen generative models and languages are available during training, thus creating several challenging open-set evaluation setups. We perform experiments with various pre-trained and fine-tuned deepfake detectors proposed in recent literature. Our results show that state-of-the-art detectors are not currently able to maintain their performance levels when tested in our open-set scenarios. We publicly release our data and code at: https://huggingface.co/datasets/unibuc-cs/MAVOS-DD.